At the time of creating this project (April 2021), 11 of the 20 results which appear on the first page of the BBC News website when searching ‘shark attacks’ mention Australia in the headline or caption. Such information is likely to inform perceptions of how often and how dangerous shark attacks are in Australia. The aim of this project is to use data visualisations to assess whether the availability heuristic influences estimations of how frequently shark attacks occur in Australia, and how many of these shark attacks are fatal.
What is the availability heuristic?
In cognitive psychology, the availability heuristic refers to a cognitive bias (or mental shortcut) which is frequently used when evaluating the likelihood of an event. The easier an example of an event comes to mind, the more common the event is deemed to be, meaning that events which are reported in the news are often perceived as being common occurrences.
The availability heuristic was first discussed by Kahneman and Tversky (1973), and was chosen to be the main focus of this project for two main reasons:
1. Anybody is prone to cognitive biases and everyone is likely to have been influenced by the availability heuristic at some point.
2. Visualisations can be useful for challenging judgments and perceptions when people are otherwise unlikely to question themselves.
Before continuing, how many shark attacks do you think have occurred in the last 50 years off the coast of Australia? Note that here, Australia refers to any mainland state or territory.
A data set has been obtained from Kaggle which includes 16 different variables related to shark attacks around the world, and this data set can be accessed here. The first 5 rows of the data, with the 4 variables relevant to this project, are shown below:
| year | country | area | fatal |
|---|---|---|---|
| 2019 | USA | Florida | N |
| 2019 | USA | Florida | N |
| 2019 | USA | Hawaii | N |
| 2019 | USA | Florida | N |
| 2019 | USA | Hawaii | N |
The following code will demonstrate the data wrangling that was performed in order to prepare the data for visualisation 1.
#See which countries have the most shark attacks
count(shark_data, country, sort=TRUE)
## # A tibble: 206 x 2
## country n
## <chr> <int>
## 1 <NA> 19359
## 2 USA 2310
## 3 AUSTRALIA 1365
## 4 SOUTH AFRICA 583
## 5 PAPUA NEW GUINEA 134
## 6 NEW ZEALAND 133
## 7 BAHAMAS 121
## 8 BRAZIL 114
## 9 MEXICO 94
## 10 ITALY 72
## # … with 196 more rows
#Limit data to the top three countries
shark_data <- shark_data[(shark_data$country=="AUSTRALIA" |
shark_data$country=="USA" |
shark_data$country =="SOUTH AFRICA"), ]
#Specify time span to be plotted
shark_data <- subset(shark_data, year > "1968" & year < "2020")
#Data to plot
country_per_year <- shark_data %>%
group_by(country) %>% count(year)
#Line graph
line_plot <- ggplot(data = country_per_year,
aes(x = as.numeric(year), y = n, group = 1, color = country,
text = paste(
"Year:", as.numeric(year),
"<br>Number of shark attacks:", n,
"<br>Country:", country),
)
) +
geom_line() +
scale_color_manual(values = c("navy", "burlywood1", "sienna2")) +
theme_light() +
ggtitle("Number of shark attacks between 1969-2019 in the three most affected countries") +
xlab("Year") +
ylab("Number of shark attacks") +
theme(text=element_text(family="Georgia")) +
scale_x_continuous(breaks=seq(1950,2020,10))
#Make plot interactive
ggplotly(line_plot, tooltip = "text") %>%
layout(legend = list(orientation = "h", x = 0.225, y = -0.2))
If you were surprised by these results, your estimation may have been influenced by the availability heuristic.
Now that we have looked at the overall trend of shark attacks in Australia and how it compares to other countries, we can look at how many of the attacks resulted in a fatal accident. To do this, we will focus on the year 2018 (as this was the year that the most attacks happened in Australia) and look at how many attacks in each area of Australia resulted in a fatality.
Out of all 39 shark attacks that happened in Australia in 2018, how many would you guess were fatal?
#Subset Australian data into new variable
aus <-
shark_data[shark_data$country == "AUSTRALIA", ]
#Assess the NA data
colSums(is.na(aus))
## year country area fatal
## 0 0 1 47
#Reassign the NA values in 'area' column to 'Area unknown'
aus$area[is.na(aus$area)] <- "Area unknown"
#Keep only the mainland areas and territories, and 'Area unknown'
aus_states <- aus[!(aus$area=="Tasmania" |
aus$area=="Torres Strait" |
aus$area=="Norfolk Island" |
aus$area=="Territory of Cocos (Keeling) Islands"),]
#Check that the remaining data is correct
count(aus_states, area, sort = TRUE)
## # A tibble: 8 x 2
## area n
## <chr> <int>
## 1 New South Wales 208
## 2 Queensland 147
## 3 Western Australia 122
## 4 South Australia 51
## 5 Victoria 41
## 6 Northern Territory 10
## 7 Westerm Australia 3
## 8 Area unknown 1
#Rename data entries which have a typo (##7)
aus_states$area[aus_states$area=="Westerm Australia"] <- "Western Australia"
#Delete excess empty rows of data
aus_states <- aus_states[-c(1300:20608), ]
#Check how many NA's remain
colSums(is.na(aus_states))
## year country area fatal
## 0 0 0 42
#Change all NA's in 'fatal' variable to 'unknown' so that they can be grouped and plotted
aus_states$fatal[is.na(aus_states$fatal)] <- "Unknown"
#Re-code data entries in 'fatal' variable so that they show up as desired in the legend
aus_states$fatal[aus_states$fatal == "UNKNOWN"] <- "Unknown"
aus_states$fatal[aus_states$fatal == "Y"] <- "Yes"
aus_states$fatal[aus_states$fatal == "N"] <- "No"
#Arrange order of data on the x-axis so that 'area unknown' is at the end
plot <- aus_states %>%
mutate(area = fct_relevel(area,
"New South Wales", "Northern Territory", "Queensland", "South Australia", "Victoria", "Western Australia", "Area unknown")) %>%
#Plot the data
ggplot(aes(x = area, y = ..count.., fill = fatal)) +
geom_bar(position = position_dodge(width = 0.5)) +
xlab("Year") +
ylab("Number of shark attacks") +
ggtitle("Number of fatal shark attacks throughout Australia in 2018") +
scale_fill_manual(name = "Fatal",
values = c("lightseagreen", "pink", "brown2")) +
scale_y_continuous(breaks=seq(0,200,20)) +
theme_light()
#Create x-axis labels with line breaks and to match order of histogram bars
labs = c("New South\n Wales", "Northern\nTerritory", "Queensland", "South\nAustralia", "Victoria", "Western\nAustralia", "Area\nunknown")
#More plot aesthetics
plot + scale_x_discrete(labels=labs) +
theme(text = element_text(family="Georgia")) +
guides(fill = guide_legend(reverse=TRUE))
This project aimed to demonstrate how the availability heuristic influences judgments about the frequency and fatality of shark attacks in Australia. The visualisations show that the most attacks in a 12-month period happened in 2018 when 39 attacks occurred, although only a small proportion of these attacks were fatal. The USA, on the other hand, consistently experienced more shark attacks than Australia during this period.
Although these visualisations enable the reader to gain an idea of how many shark attacks take place and how many of them are deadly, no data was available regarding people’s estimates. This meant that, in order to assess the affect of the availability heuristic on judgments, I had to rely on asking the reader to make mental estimates in order for them to assess how accurate their judgements were. To improve this project, data should be collected regarding estimations so that a visualisation can be plotted to provide a direct comparison between estimations and real data in order to assess how accurate judgments are… If estimates are too high, it may suggest that the availability heuristic is at play!